if (!requireNamespace("Seurat", quietly = TRUE))
install.packages("Seurat")
if (!requireNamespace("tidyverse", quietly = TRUE))
install.packages("tidyverse")
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
if (!requireNamespace("decoupleR", quietly = TRUE))
remotes::install_github('saezlab/decoupleR')
if (!requireNamespace("ComplexHeatmap", quietly = TRUE))
BiocManager::install("ComplexHeatmap")
if (!requireNamespace("glue", quietly = TRUE))
install.packages("glue")
if (!requireNamespace("colorBlindness", quietly = TRUE))
install.packages("colorBlindness")
if (!requireNamespace("ggpmisc", quietly = TRUE))
install.packages("ggpmisc")
if (!requireNamespace("circlize", quietly = TRUE))
install.packages("circlize")
library(Seurat)
library(tidyverse)
library(decoupleR)
library(ComplexHeatmap)
library(colorBlindness)
library(ggpmisc)
# Remember to set a seed so the analysis is reproducible!
set.seed(687)6- Gene Signatures - How to score & interpret them
Introduction
Gene signatures are commonly used in routine single cell analysis. Many methods exists but they are not all created equally. In this tutorial we are going to go follow a recent benchmarking paper Badia-i-Mompel et al. (2022) and follow their guidelines on best practices when scoring gene signatures!
With this tutorial we hope to familiarize you with the concepts of gene signatures, how they are scored in single cell datasets and how to interpret the scores obtained!
Before we start here are some key concepts that will help us and frame the vignette!
What is a gene signature?
A “gene signature” can be stated as a single or a group of genes in a cell having a unique pattern of gene expression that is the consequence of either changed biological process or altered pathogenic medical terms Mallik and Zhao (2018).
What is a cell type signature?
A cell type signature is a gene signature representing a group of genes underlying the biological processes characteristic of a cell type.
How do we score them in our dataset?
Scoring a gene signature means to obtain a value for that signature for each cell in our datasets that represents how active the gene program is in each cell. There are many ways to score gene signatures as shown in the
decoupleRpaper Badia-i-Mompel et al. (2022). However, they do not all perform the same and it is important to select a robust method. The suggested method after their benchmarking analysis is running a Univariate Linear Model (ULM) where the gene expression values are the response variable and the regulator weights in the gene signature are the explanatory one (don’t worry, we’ll go through this in more detail in a second). The obtained t-value from the fitted model is the activity score of that gene signature in that cell.How do we interpret that score?
Scoring gene signatures using Univariate Linear Models and using the resulting t-value as the scoring metric allows us to simultaneously interpret in a single statistic the direction of activity (either + or -) and its significance (the magnitude of the score).
Can we interrogate the scores obtained?
Yes! In fact it is very important to look past the score obtained by a cell and into which are the genes driving that score. Sometime with gene signatures containing 50 genes it could be that just a few genes are contributing to the signature. If we just stopped at the score we could be mislead into thinking that all of the genes making up the signature are important when it is actually only a fraction of them. Moreover, heterogeneous gene expression between two populations can also lead to 2 cells or populations having similar scores but vastly different genes gene programs underlying them.
Libraries
Load the libraries and install the packages needed to run this notebook
Load data
We’re going to be working with a dataset from the paper - Immunophenotyping of COVID-19 and influenza highlights the role of type I interferons in development of severe COVID-19 Download data from cellxgene portal.
# Download the data in data/ directory
# download.file(
# url = "https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds",
# destfile = "../data/workshop-data.rds",
# method = "wget",
# extra = "-r -p --random-wait")
# We can also use the CLI with the wget command below
# wget https://datasets.cellxgene.cziscience.com/d8e35450-de43-451a-9979-276eac688bce.rds
se <- readRDS("../data/workshop-data.rds")
# Remove Uncategorized
se <- se[, ! se$Celltype %in% c("Uncategorized1", "Uncategorized2")]
se$Celltype <- as.character(se$Celltype)Generate a color palette for our cell types
# https://www.datanovia.com/en/blog/easy-way-to-expand-color-palettes-in-r/
nb.cols <- length(unique(se$Celltype))
pal <- colorRampPalette(paletteMartin)(nb.cols)
names(pal) <- sort(unique(se$Celltype))Analysis
Convert ENSEMBL IDs to Gene Symbols
Right away we can see how ensembl ids are used in the rownames. Let’s transform them into their matched symbols to make them human-readable:
head(rownames(se))[1] "ENSG00000000003" "ENSG00000000005" "ENSG00000000419" "ENSG00000000457" "ENSG00000000460" "ENSG00000000938"
Convert to gene symbols
gene_df <- readr::read_csv(file = "../data/cov_flu_gene_names_table.csv")
symbol_id <- data.frame(index = rownames(se)) %>%
left_join(gene_df, by = "index") %>%
pull(feature_name)
# re-create seurat object
mtx <- se@assays$RNA$data
rownames(mtx) <- symbol_id
se <- CreateSeuratObject(counts = mtx, meta.data = se@meta.data)Preprocessing
We will do a quick preprocessing of the data. 1) log-normalize, 2) identify highly variable genes, 3) scale their expression and 4) compute PCA on the scaled data.
se <- se %>%
NormalizeData(verbose = FALSE) %>%
FindVariableFeatures(nfeatures = 3000, verbose = FALSE) %>%
ScaleData(verbose = FALSE) %>%
RunPCA(verbose = FALSE)Next we check the elbow plot to determine the number of PCs to use for the downstream analysis and then compute UMAP, K-nearest neighbor graph (KNN graph) and run Louvain clustering on it.
# Look at elbow plot to assess the number of PCs to use
ElbowPlot(se, ndims = 50)We can see a clear elbow at 10 PCs, we’re going to extend it a bit more and use 15 PCs for the downstream analysis to make sure we are not loosing any biological signal
se <- RunUMAP(se, reduction = "pca", dims = 1:30, verbose = FALSE)Next we compute the K-nearest-neighbor graph
# se <- se %>%
# FindNeighbors(verbose = FALSE) %>%
# FindClusters(resolution = c(0.05, 0.1, 0.125, 0.15, 0.2), verbose = FALSE)
# Visualize the Celltypes on the UMAP
# DimPlot(
# se,
# group.by = "Celltype")For the purpose of this tutorial we are going to proceed with resolution 0.15
(dim_plt <- DimPlot(
se,
group.by = "Celltype") +
scale_color_manual(values = pal))Gene Signature Scoring
Here we define some gene signatures based on prior knowledge. We are setting gene signature that are characteristic for specific cell types to score for their activities in our dataset.
bcell <- c("MS4A1", "CD79A", "CD79B", "BANK1", "HLA-DQB1", "HLA-DQA1")
tcell <- c("CD3D", "CD3E", "TRAC", "TRBC1", "TRBC2", "CD4", "CD8A", "CD8B")
tnaive <- c(tcell, "IL7R", "CCR7", "TCF7", "LEF1", "SELL")
cd8cyto <- c(tcell, "GZMA", "GZMK", "NKG7", "CCL5")
mono <- c("FCGR3A", "CD14", "S100A9", "S100A8", "MS4A7")
nks <- c("NCR1", "NCR2", "NCR3", "FCGR3A", "GZMA", "GZMK", "NKG7", "CCL5")We can see how there are some genes that are specific for each signature but others are shared between them. This is important to take into account when computing the gene signatures and interpreting their scores.
To help us compute these gene signatures we are going to use the R package decoupleR from Bioconductor. decoupleR is a great for carrying out these analysis since it is a framework that contains different statistical methods to compute these scores. Ultimately we will obtain a score for each signature for each cell.
decoupleR requires the gene signatures to be passed as a dataframe so we are going to convert our gene signature vectors into a unified dataframe. mor stands for Mode Of Regulation, at the moment since we don’t have a score of how important that gene is for that signature we are going to weight them all equally with a value of 1.
sig_ls <- list("B cells" = bcell, "T cells" = tcell, "Naive T cells" = tnaive,
"CD8 Cytotoxic" = cd8cyto, "Monocytes" = mono, "NKs" = nks)
sig_df <- lapply(names(sig_ls), function(i) {
data.frame(
signature = i,
gene = sig_ls[[i]],
mor = 1
)
}) %>% bind_rows()
sig_df signature gene mor
1 B cells MS4A1 1
2 B cells CD79A 1
3 B cells CD79B 1
4 B cells BANK1 1
5 B cells HLA-DQB1 1
6 B cells HLA-DQA1 1
7 T cells CD3D 1
8 T cells CD3E 1
9 T cells TRAC 1
10 T cells TRBC1 1
11 T cells TRBC2 1
12 T cells CD4 1
13 T cells CD8A 1
14 T cells CD8B 1
15 Naive T cells CD3D 1
16 Naive T cells CD3E 1
17 Naive T cells TRAC 1
18 Naive T cells TRBC1 1
19 Naive T cells TRBC2 1
20 Naive T cells CD4 1
21 Naive T cells CD8A 1
22 Naive T cells CD8B 1
23 Naive T cells IL7R 1
24 Naive T cells CCR7 1
25 Naive T cells TCF7 1
26 Naive T cells LEF1 1
27 Naive T cells SELL 1
28 CD8 Cytotoxic CD3D 1
29 CD8 Cytotoxic CD3E 1
30 CD8 Cytotoxic TRAC 1
31 CD8 Cytotoxic TRBC1 1
32 CD8 Cytotoxic TRBC2 1
33 CD8 Cytotoxic CD4 1
34 CD8 Cytotoxic CD8A 1
35 CD8 Cytotoxic CD8B 1
36 CD8 Cytotoxic GZMA 1
37 CD8 Cytotoxic GZMK 1
38 CD8 Cytotoxic NKG7 1
39 CD8 Cytotoxic CCL5 1
40 Monocytes FCGR3A 1
41 Monocytes CD14 1
42 Monocytes S100A9 1
43 Monocytes S100A8 1
44 Monocytes MS4A7 1
45 NKs NCR1 1
46 NKs NCR2 1
47 NKs NCR3 1
48 NKs FCGR3A 1
49 NKs GZMA 1
50 NKs GZMK 1
51 NKs NKG7 1
52 NKs CCL5 1
ULM
“Univariate Linear Model (ULM) fits a linear model for each sample and regulator, where the observed molecular readouts in mat are the response variable and the regulator weights in net are the explanatory one. Target features with no associated weight are set to zero. The obtained t-value from the fitted model is the activity ulm of a given regulator.”
Moreover, a nice thing about ulm is that in a single statistic it provides the direction of activity (either + or -) and its significance (the magnitude of the score). Making the scores very easy to interpret!
So lets compute the signature scores for every cell in our dataset!
ulm_start <- Sys.time()
res <- decoupleR::run_ulm(
mat = se@assays$RNA$data,
network = sig_df,
.source = "signature",
.target = "gene",
.mor = "mor")
glue::glue("Time to run ulm is {round(difftime(Sys.time(), ulm_start, units = 's'), 0)} seconds")Time to run ulm is 32 seconds
# remove densified memory created by decoupleR
gc() # gc stands for garbage collection used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 5888640 314.5 10363954 553.5 NA 10363954 553.5
Vcells 691412360 5275.1 5465424844 41697.9 98304 5700883714 43494.3
We can see how every cells has a score for every signature!
# Looking at the first 10 entries
res# A tibble: 349,194 × 5
statistic source condition score p_value
<chr> <chr> <chr> <dbl> <dbl>
1 ulm B cells AAACCCAAGAATGTTG-12 0.648 5.17e- 1
2 ulm B cells AAACCCAAGCATTGTC-19 -0.571 5.68e- 1
3 ulm B cells AAACCCAAGCTACGTT-6 -0.550 5.82e- 1
4 ulm B cells AAACCCAAGGCCGCTT-3 0.887 3.75e- 1
5 ulm B cells AAACCCAAGGCCTGCT-12 1.00 3.15e- 1
6 ulm B cells AAACCCAAGGGCAATC-1 16.4 4.37e-60
7 ulm B cells AAACCCAAGTAAGGGA-18 8.41 4.35e-17
8 ulm B cells AAACCCAAGTGTAGAT-5 0.730 4.65e- 1
9 ulm B cells AAACCCACAAGAGCTG-17 0.241 8.10e- 1
10 ulm B cells AAACCCACACAATGCT-5 0.460 6.46e- 1
# ℹ 349,184 more rows
# Check the cell type for cell AAACCCAAGGGCAATC-1
se@meta.data[c("AAACCCAAGGGCAATC-1", "AAACCCAAGGCCTGCT-12"), "Celltype", drop = FALSE] Celltype
AAACCCAAGGGCAATC-1 B cell, IgG+
AAACCCAAGGCCTGCT-12 NK cell
ct_b <- "AAACCCAAGGGCAATC-1"
ct_nk <- "AAACCCAAGGCCTGCT-12"
# Looking at all the scores for one specific cell
res %>% filter(condition %in% c(ct_b, ct_nk))# A tibble: 12 × 5
statistic source condition score p_value
<chr> <chr> <chr> <dbl> <dbl>
1 ulm B cells AAACCCAAGGCCTGCT-12 1.00 3.15e- 1
2 ulm B cells AAACCCAAGGGCAATC-1 16.4 4.37e-60
3 ulm T cells AAACCCAAGGCCTGCT-12 0.722 4.70e- 1
4 ulm T cells AAACCCAAGGGCAATC-1 -0.175 8.61e- 1
5 ulm Naive T cells AAACCCAAGGCCTGCT-12 0.278 7.81e- 1
6 ulm Naive T cells AAACCCAAGGGCAATC-1 1.01 3.10e- 1
7 ulm CD8 Cytotoxic AAACCCAAGGCCTGCT-12 3.61 3.13e- 4
8 ulm CD8 Cytotoxic AAACCCAAGGGCAATC-1 -0.433 6.65e- 1
9 ulm Monocytes AAACCCAAGGCCTGCT-12 1.19 2.33e- 1
10 ulm Monocytes AAACCCAAGGGCAATC-1 1.41 1.60e- 1
11 ulm NKs AAACCCAAGGCCTGCT-12 4.71 2.49e- 6
12 ulm NKs AAACCCAAGGGCAATC-1 -0.711 4.77e- 1
We can see how this cell had been annotated as a B cell and when we look at the scores the B cell signature has the highest score with a significant p value.
How does a univariate linear model work?
Lets start with a toy example. Imagine a very simple scenario where we have two very simple vectors where one is double the other. We can compute the linear model and also easily visualize the relationship between both vectors:
# Define vectors of interest
vec1 <- c(1, 2, 5)
vec2 <- c(2.1, 3.8, 9.7)
# Run the linear model
summary(lm(vec2 ~ vec1))
Call:
lm(formula = vec2 ~ vec1)
Residuals:
1 2 3
0.09231 -0.12308 0.03077
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.09231 0.16853 0.548 0.6810
vec1 1.91538 0.05329 35.940 0.0177 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1569 on 1 degrees of freedom
Multiple R-squared: 0.9992, Adjusted R-squared: 0.9985
F-statistic: 1292 on 1 and 1 DF, p-value: 0.01771
# Visualize the data
(p <- ggplot(mapping = aes(x = vec1, y = vec2)) +
geom_point() +
geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
coord_fixed() +
xlim(0, 10) +
ylim(0, 10) +
theme_minimal())# now we can add the slope of the line that best fits our data and the T value
p +
stat_poly_line(formula = y ~ x, se = FALSE) +
stat_poly_eq(use_label(c("eq"))) +
stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)In the example above we see the linear relationship between both vectors and we get the slope and the T value:
- The slope indicates the what is the change in the response variable (vec2) given a 1 unit change in the predictor variable (vec1).
- The T statistic is the result of a T test. The T test assesses the significance of individual coefficients in our model. The T value indicates the number of standard errors the estimated coefficient is away from the null hypothesis (t = 0). Remember the T value is the \(\frac{coefficient}{standard~error}\).
Now lets look at a “real world” example, we want to score the B cell signature in one cell. First we are going to start by visualizing the relationship between the weights and the gene expression for 2 cells of interest, one is a B cell and the other is not.
We need to do a bit of data prep but bear with me!
# We have our gene expression matrix
mat <- se@assays$RNA$data
# We want to obtain a matrix with 1s and 0s indicating the weight each gene has for each signature
## Initialize mor_mat with all 0s
sources <- unique(sig_df$signature)
targets <- rownames(mat)
mor_mat <- matrix(0, ncol = length(sources), nrow = nrow(mat))
colnames(mor_mat) <- sources
rownames(mor_mat) <- targets
weights <- sig_df$mor
# Fill in the matrix with the weights in the right places
for (i in 1:nrow(sig_df)) {
.source <- sig_df$signature[[i]]
.target <- sig_df$gene[[i]]
.weight <- weights[[i]]
if (.target %in% targets) {
mor_mat[[.target, .source]] <- .weight
}
}# labels for geom_text_repel
repel_text <- rownames(mat)
keep <- which(rownames(mat) %in% bcell)
# Set non-selected positions to NA
repel_text[-keep] <- NA
# Visualize the data
ggplot(mapping = aes(x = mat[, ct_b], y = mor_mat[, "B cells", drop = FALSE])) +
geom_point() +
ggrepel::geom_text_repel(aes(label = repel_text)) +
geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
labs(x = "Gene Expression", y = "Gene Weight", title = glue::glue("Cell {ct_b} - B cell")) +
coord_fixed() +
xlim(0, 10) +
ylim(0, 5) +
theme_minimal() +
# now we can add the slope of the line that best fits our data and the T value
stat_poly_line(formula = x ~ y, se = FALSE) +
stat_poly_eq(use_label(c("eq"))) +
stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)# Visualize the data
ggplot(mapping = aes(x = mat[, ct_nk], y = mor_mat[, "B cells", drop = FALSE])) +
geom_point() +
ggrepel::geom_text_repel(aes(label = repel_text)) +
geom_abline(slope = 1, color = "red", linetype = "dashed") + # Passing a slope = 1
labs(x = "Gene Expression", y = "Gene Weight", title = glue::glue("Cell {ct_nk} - Not B cell")) +
coord_fixed() +
xlim(0, 7.5) +
ylim(0, 2.5) +
theme_minimal() +
# now we can add the slope of the line that best fits our data and the T value
stat_poly_line(formula = x ~ y, se = FALSE) +
stat_poly_eq(use_label(c("eq"))) +
stat_correlation(use_label(c("t")), label.x = 0.05, label.y = 0.9)Next we are going to manually run the models for these two cells so that we can see that the results obtained from decoupleR make sense!
Check that the mor_mat has the weights in the right places
bcell[1] "MS4A1" "CD79A" "CD79B" "BANK1" "HLA-DQB1" "HLA-DQA1"
mor_mat["MS4A1", ] # We can see how MS4A1 only has a weight in the B cell signature! B cells T cells Naive T cells CD8 Cytotoxic Monocytes NKs
1 0 0 0 0 0
mor_mat["CD3D", ] # We can see how CD3D has weights in all T cell signatures! B cells T cells Naive T cells CD8 Cytotoxic Monocytes NKs
0 1 1 1 0 0
Lets run the a linear model to score two cell for the B cell signature
mod1 <- lm(as.matrix(mat[, ct_b, drop = FALSE]) ~ mor_mat[, "B cells", drop = FALSE])
summary(mod1)
Call:
lm(formula = as.matrix(mat[, ct_b, drop = FALSE]) ~ mor_mat[,
"B cells", drop = FALSE])
Residuals:
Min 1Q Median 3Q Max
-0.3411 -0.0814 -0.0814 -0.0814 6.1124
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.081443 0.001778 45.80 <2e-16 ***
mor_mat[, "B cells", drop = FALSE] 2.167981 0.132333 16.38 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3241 on 33232 degrees of freedom
Multiple R-squared: 0.008012, Adjusted R-squared: 0.007982
F-statistic: 268.4 on 1 and 33232 DF, p-value: < 2.2e-16
2.167981 / 0.132333 # This equals the T value[1] 16.38277
mod2 <- lm(as.matrix(mat[, ct_nk, drop = FALSE]) ~ mor_mat[, "B cells", drop = FALSE])
summary(mod2)
Call:
lm(formula = as.matrix(mat[, ct_nk, drop = FALSE]) ~ mor_mat[,
"B cells", drop = FALSE])
Residuals:
Min 1Q Median 3Q Max
-0.2301 -0.0775 -0.0775 -0.0775 6.9463
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.077458 0.002042 37.928 <2e-16 ***
mor_mat[, "B cells", drop = FALSE] 0.152596 0.151992 1.004 0.315
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3723 on 33232 degrees of freedom
Multiple R-squared: 3.033e-05, Adjusted R-squared: 2.397e-07
F-statistic: 1.008 on 1 and 33232 DF, p-value: 0.3154
0.152596 / 0.151992 # This equals the T value[1] 1.003974
We can see how mod1 has returned a high coefficient for cell TTTGCATGAGAGGC while mod2 has returned a low coefficient for cell AAGATGGAGATAAG. Moreover, when we look at the T value for the B cell signature we see how in mod1 it is 8.881 while for mod2 it is 0.697.
Lets check that the T values obtained manually actually match those returned by decoupleR
res %>% filter(source == "B cells" & condition %in% c(ct_b, ct_nk))# A tibble: 2 × 5
statistic source condition score p_value
<chr> <chr> <chr> <dbl> <dbl>
1 ulm B cells AAACCCAAGGCCTGCT-12 1.00 3.15e- 1
2 ulm B cells AAACCCAAGGGCAATC-1 16.4 4.37e-60
Effectively, they do!
Visualization
We can directly add the ulm scores to an assay in our object and visualize the results
se[['signatureulm']] <- res %>%
pivot_wider(
id_cols = 'source',
names_from = 'condition',
values_from = 'score') %>%
column_to_rownames('source') %>%
Seurat::CreateAssayObject(.)
# Change assay
DefaultAssay(object = se) <- "signatureulm"
# Scale the data for comparison across signatures
se <- ScaleData(se)
se@assays$signatureulm@data <- se@assays$signatureulm@scale.dataUMAP visualization
Plot all the gene signatures one after the other
plt <- FeaturePlot(
se,
features = rownames(se@assays$signatureulm),
ncol = 3) &
scale_color_viridis_c(option = "magma")
plt + dim_pltHeatmap by groups
We can also visualize the gene signature scores for each individual cell using a heatmap
DoHeatmap(
se,
features = rownames(se@assays$signatureulm),
group.by = "Celltype",
slot = "data",
group.colors = RColorBrewer::brewer.pal(n = 7, name = "Dark2")) +
scale_fill_viridis_c(option = "viridis")From the plot above we can see how we have very distinct populations in our datasets. We can also look at it a bit less granular by looking at the mean activity score per cluster.
# Extract activities from object as a long dataframe
df <- t(as.matrix(se@assays$signatureulm@data)) %>%
as.data.frame() %>%
mutate(cluster = se$Celltype) %>%
pivot_longer(cols = -cluster, names_to = "source", values_to = "score") %>%
group_by(cluster, source) %>%
summarise(mean = mean(score))
# Transform to wide matrix
top_acts_mat <- df %>%
pivot_wider(id_cols = 'cluster', names_from = 'source',
values_from = 'mean') %>%
column_to_rownames('cluster') %>%
as.matrix()
# Choose color palette
palette_length = 100
my_color = colorRampPalette(c("#00008B", "white","red"))(palette_length)
# Show which is the max and min of the scaled value to make sure we set a scale that makes sense
glue::glue("Note that the maximum scaled value is: {round(max(top_acts_mat), 2)}, and the minimum is {round(min(top_acts_mat), 2)}.")Note that the maximum scaled value is: 2.66, and the minimum is -0.87.
my_breaks <- c(seq(quantile(top_acts_mat, .01), 0, length.out=ceiling(palette_length/2) + 1),
seq(0.05, quantile(top_acts_mat, .99), length.out=floor(palette_length/2)))
# Plot
ComplexHeatmap::pheatmap(top_acts_mat, border_color = NA, color=my_color, breaks = my_breaks)Heatmap for gene expression
To fully grasp which genes are driving each gene signature within each cell we want to visualize the gene expression of the genes involved in each gene signature for each cell. We can do so using the ComplexHeatmap package and a little bit of data processing. For ease here is a function you can incorporate in your analysis:
geneHM <- function(
object,
sig_df,
sig_name,
sig_assay,
.source,
.target,
sig_slot = "data",
expr_assay = "RNA",
expr_slot = "data",
grouping = NULL,
grouping_color = NULL,
expr_cols = viridisLite::magma(100)) {
# Extract Gene Expression Matrix from Seurat Object
gene_expr <- GetAssayData(object, assay = expr_assay, layer = expr_slot)
# Subset the genes of the signature from the Gene Expression Matrix
genes_of_interest <- sig_df[, .target][which(sig_df[, .source] %in% sig_name)]
# Subset the genes intersecting between gene expression and genes in signature
g_int <- intersect(rownames(gene_expr), genes_of_interest)
if (length(g_int) < length(genes_of_interest)) {
genes_excluded <- genes_of_interest[!genes_of_interest %in% rownames(gene_expr)]
genes_excluded <- paste(genes_excluded, collapse = ", ")
message(paste0(
"Genes ", genes_excluded,
" are in the gene signature but not in the expression matrix,",
" therefore, they have been excluded."))
}
# Subset expression matrix to only genes of interest
gene_expr <- gene_expr[g_int, ]
# Extract the Scores of the Signature of interest
sig_score <- GetAssayData(object, assay = sig_assay, layer = sig_slot)
sig_vec <- sig_score[sig_name, ]
anno <- data.frame(score = sig_vec)
# Make sure they are in the right order
anno <- anno[colnames(gene_expr), , drop = FALSE]
# Add any metadata if specified
if (!is.null(grouping)) {
meta <- object@meta.data[, grouping, drop = FALSE]
anno <- cbind(anno, meta[rownames(anno), , drop = FALSE])
}
if (any(is.infinite(c(anno$score))))
stop("There are scores with Inf values, please address this outside of this function. It could be because the slot used is scale_data.")
# Make list of color to paint the annotation columns
if (!is.null(grouping) & !is.null(grouping_color)) {
score <- circlize::colorRamp2(
breaks = c(quantile(anno$score, 0.01), 0, quantile(anno$score, 0.99)),
colors = c("blue", "white", "red"))
color_ls <- append(grouping_color, score)
names(color_ls)[length(color_ls)] <- "score"
} else {
color_ls <- list(
score = circlize::colorRamp2
(breaks = c(min(anno$score), 0, max(anno$score)),
colors = c("blue", "white", "red")),
annot = clust_color)
}
# Set the order from most expressing to least expressing
ord <- rownames(anno[order(anno$score, decreasing = TRUE), ])
# Add the score of the signature as annotation in the heatmap
colAnn <- HeatmapAnnotation(
df = anno[ord, , drop = FALSE],
which = 'column',
col = color_ls
)
# Visualize the Heatmap with the genes and signature
ht <- Heatmap(
as.matrix(gene_expr[, ord]),
name = "Gene Expression",
col = expr_cols,
cluster_rows = TRUE,
cluster_columns = TRUE,
column_title = sig_name,
column_names_gp = gpar(fontsize = 14),
show_column_names = FALSE,
top_annotation = colAnn)
# Return ComplexHeatmap
draw(ht)
}
# Visualize the heatmaps for all signatures
# tt <- lapply(unique(sig_df$signature), function(i) {
# geneHM(
# object = se,
# sig_df = sig_df,
# .source = "signature",
# .target = "gene",
# sig_name = i,
# expr_slot = "data",
# expr_assay = "RNA",
# sig_assay = "signatureulm",
# sig_slot = "data",
# grouping = c("RNA_snn_res.0.15"),
# grouping_color = list(RNA_snn_res.0.15 = clust_color))
# })Here are some examples of how to interpret these gene signatures:
- In the Monocyte signature not all cells that have the same score have the same genes expressed. For example, wee can see how among the cells with high scores there are cells that express CD14 with a gradient switching to expression of FCGR3A. Therefore, classifying all of these cells as the same just because the gene signature scoring returns the same value would be a mistake. In this case, a likely scenario could be that the gene signature isn’t teasing out differences between classical, intermediate, and non-classical monocytes.
# Subset to 10% of the original dataset for speed and visualization purposes
se_sub <- se[, sample(colnames(se), 0.1 * ncol(se))]
geneHM(
object = se_sub,
sig_df = sig_df,
.source = "signature",
.target = "gene",
sig_name = "Monocytes",
expr_slot = "data",
expr_assay = "RNA",
sig_assay = "signatureulm",
sig_slot = "data",
grouping = c("Celltype"),
grouping_color = list(Celltype = pal))- When looking at the CD8 Cytotoxic compartment we also observe how NK cells have a high score despite not expressing CD3 genes. This can be due to the simulatenous high expression of NKG7 and CCL5 in NKs and CD8 T cells.
geneHM(
object = se_sub,
sig_df = sig_df,
.source = "signature",
.target = "gene",
sig_name = "CD8 Cytotoxic",
expr_slot = "data",
expr_assay = "RNA",
sig_assay = "signatureulm",
sig_slot = "data",
grouping = c("Celltype"),
grouping_color = list(Celltype = pal))- This is more of a dummy example but when assessing the T cell signature
c("CD3D", "CD3E", "TRAC", "TRBC1", "TRBC2", "CD4", "CD8A", "CD8B")CD8 T cells have a higher score than CD4 T cells. This is due to there being two CD8 genes (A & B) as well as CD8B being expressed at higher levels than CD4. Therefore, we need to keep an eye on these things to better interpret the heterogeneity within our populations.
geneHM(
object = se_sub,
sig_df = sig_df,
.source = "signature",
.target = "gene",
sig_name = "T cells",
expr_slot = "data",
expr_assay = "RNA",
sig_assay = "signatureulm",
sig_slot = "data",
grouping = c("Celltype"),
grouping_color = list(Celltype = pal))In summary, when scoring gene signatures and looking at their activities it is important to not only look at the oveall score obtained for each cell but one also needs to dive deeper into which are the genes that are driving that signature!
Extra!
In the above example we showed how to compute signature scores using ulm, if we take a closer look to the original decoupleR paper (badia-i-mompel 2022) we can see how in Supplementary Figure 3 ulm and mlm slightly outperform norm_wmean in terms of AUROC and AUPRC. Moreover, in the Bioconductor Vignettes they showcase the use of run_wmean instead of ulm. So… why use ulm instead of norm_wmean?
Run norm_wmean
In this section we are going to show how computing the normalized weighted mean with 1,000 permutations provides a very similar result to the ulm but takes much longer!.
Run decouple on our data using the wmean method. As mentioned in the details of the function: “WMEAN infers regulator activities by first multiplying each target feature by its associated weight which then are summed to an enrichment score wmean. Furthermore, permutations of random target features can be performed to obtain a null distribution that can be used to compute a z-score norm_wmean, or a corrected estimate corr_wmean by multiplying wmean by the minus log10 of the obtained empirical p-value.”.
wmean_start <- Sys.time()
res_wmean <- decoupleR::run_wmean(
mat = se@assays$RNA$data,
network = sig_df,
.source = "signature",
.target = "gene",
.mor = "mor",
times = 1000)
glue::glue("Time to run norm_wmean with 1000 iterations is {round(difftime(Sys.time(), wmean_start, units = 's'), 0)} seconds")
res_wmeanWe obtain a long format tibble containing:
statistic - Indicating which method is associated with each score
wmean: multiplying each target feature by its associated weight which then are summed to an enrichment score
norm_wmean: permutations of random target features can be performed to obtain a null distribution that can be used to compute a z-score
norm_wmeancorr_wmean: corrected estimate by multiplying
wmeanby the minus log10 of the obtained empirical p-value
source (aka - signature name)
condition - cell barcode
score - the signature score, the inferred biological activity.
p_value - P value obtained from permutations
Compare ulm with norm_wmean scores
res2 <- res %>%
dplyr::select(statistic, source, score, condition)
colnames(res2) <- glue::glue("{colnames(res2)}_ulm")
res_wmean2 <- res_wmean %>%
dplyr::filter(statistic == "norm_wmean") %>%
dplyr::select(statistic, source, score, condition)
colnames(res_wmean2) <- glue::glue("{colnames(res_wmean2)}_wmean")
res2 %>%
left_join(
res_wmean2,
by = c("condition_ulm" = "condition_wmean", "source_ulm" = "source_wmean")) %>%
ggplot(aes(x = score_ulm, y = score_wmean)) +
geom_point() +
facet_wrap(~source_ulm, scales = "free") +
stat_smooth(method = "lm", formula = y ~ x, geom = "smooth") +
ggpubr::stat_cor(method = "pearson") +
labs(x = "ulm", y = "norm_wmean") +
theme_minimal()Session Info
sessionInfo()R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggpmisc_0.5.4-1 ggpp_0.5.4 colorBlindness_0.1.9 ComplexHeatmap_2.16.0 decoupleR_2.9.1 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2 readr_2.1.4 tidyr_1.3.1 tibble_3.2.1 ggplot2_3.5.0 tidyverse_2.0.0 Seurat_5.0.2 SeuratObject_5.0.1 sp_2.1-3
loaded via a namespace (and not attached):
[1] RcppAnnoy_0.0.22 splines_4.3.1 later_1.3.2 polyclip_1.10-6 confintr_1.0.2 fastDummies_1.7.3 lifecycle_1.0.4 doParallel_1.0.17 globals_0.16.2 lattice_0.21-8 vroom_1.6.3 MASS_7.3-60 magrittr_2.0.3 plotly_4.10.4 rmarkdown_2.25 yaml_2.3.8 remotes_2.4.2.1 httpuv_1.6.14 sctransform_0.4.1 spam_2.10-0 spatstat.sparse_3.0-3 reticulate_1.35.0.9000 cowplot_1.1.3 pbapply_1.7-2 RColorBrewer_1.1-3 abind_1.4-5 Rtsne_0.17 BiocGenerics_0.46.0 circlize_0.4.15 IRanges_2.34.1 S4Vectors_0.38.1 ggrepel_0.9.5 irlba_2.3.5.1 listenv_0.9.1 spatstat.utils_3.0-4 MatrixModels_0.5-2 goftest_1.2-3 RSpectra_0.16-1 spatstat.random_3.2-2 fitdistrplus_1.1-11 parallelly_1.37.0 leiden_0.4.3.1 codetools_0.2-19 tidyselect_1.2.0 shape_1.4.6 farver_2.1.1 matrixStats_1.2.0 stats4_4.3.1 spatstat.explore_3.2-6 jsonlite_1.8.8 GetoptLong_1.0.5
[52] ellipsis_0.3.2 progressr_0.14.0 ggridges_0.5.6 survival_3.5-7 iterators_1.0.14 foreach_1.5.2 tools_4.3.1 ica_1.0-3 Rcpp_1.0.12 glue_1.7.0 gridExtra_2.3 xfun_0.42 withr_3.0.0 BiocManager_1.30.22 fastmap_1.1.1 fansi_1.0.6 SparseM_1.81 digest_0.6.34 timechange_0.2.0 R6_2.5.1 mime_0.12 gridGraphics_0.5-1 colorspace_2.1-0 Cairo_1.6-1 scattermore_1.2 tensor_1.5 spatstat.data_3.0-4 utf8_1.2.4 generics_0.1.3 data.table_1.15.0 httr_1.4.7 htmlwidgets_1.6.4 uwot_0.1.16 pkgconfig_2.0.3 gtable_0.3.4 lmtest_0.9-40 htmltools_0.5.7 dotCall64_1.1-1 clue_0.3-64 scales_1.3.0 png_0.1-8 knitr_1.45 rstudioapi_0.15.0 tzdb_0.4.0 reshape2_1.4.4 rjson_0.2.21 nlme_3.1-163 zoo_1.8-12 GlobalOptions_0.1.2 KernSmooth_2.23-22 parallel_4.3.1
[103] miniUI_0.1.1.1 pillar_1.9.0 vctrs_0.6.5 RANN_2.6.1 promises_1.2.1 xtable_1.8-4 cluster_2.1.4 evaluate_0.23 magick_2.7.5 cli_3.6.2 compiler_4.3.1 rlang_1.1.3 crayon_1.5.2 future.apply_1.11.1 labeling_0.4.3 plyr_1.8.9 stringi_1.8.3 viridisLite_0.4.2 deldir_2.0-2 munsell_0.5.0 lazyeval_0.2.2 spatstat.geom_3.2-8 quantreg_5.97 Matrix_1.6-5 RcppHNSW_0.6.0 hms_1.1.3 patchwork_1.2.0 bit64_4.0.5 future_1.33.1 shiny_1.8.0 ROCR_1.0-11 igraph_2.0.2 bit_4.0.5 polynom_1.4-1